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Summary: Comparing large covariance matrices has important applications in modern genomics, where scientists 
are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large 
number of genes vary between different biological states. We propose a computationally fast procedure for testing 
the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than 
the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the 
unknown covariance matrices. Hence the test is robust with respect to various complex dependence structures that 
frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment 
conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice 
property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene 
expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different 
gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides 
new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been 
implemented in an R-package HDtest and is available on CRAN. 

Key WORDS: Differential expression analysis; Gene clustering; High dimension; Hypothesis testing; Parametric 
bootstrap; Sparsity. 
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1. Introduction 

The problem of comparing two large population covariance matrices has important appli¬ 
cations in modern genomics, where growing attentions have been devoted to understanding 
how the relationship (e.g. dependencies or co-regulations) among genes vary between different 
biological states. Our interest in this problem is motivated by a microarray study on human 
asthma (Voraphani et ah, 2014). This study consists of 88 asthma patients and 20 controls. 
It is known that genes tend to work collectively in groups to achieve certain biological tasks. 
Our analysis focuses on such groups of genes (gene sets) defined with the gene ontology 
(GO) framework, which are referred to as GO terms. Identifying GO terms with altered 
dependence structures between disease and control groups provides critical information on 
differential gene pathways associated with asthma. Many of the GO terms contain a large 
number of (in the asthma data, as many as 8,070) genes. The large dimension of microarray 
data and the complex dependence structure among genes make the problem of comparing 
two population matrices extremely challenging. 

In conventional multivariate analysis where the dimension p is fixed, testing the equality 
of two unknown covariance matrices Si and S 2 based on the samples with sample sizes n 
and m has been extensively studied, see for example Anderson (2003) and the references 
therein. In the high-dimensional setting where p > ma x(n,m), recently several authors have 
developed new tests other than the traditional likelihood ratio test. Considering multivariate 
normal data, Schott (2007) and Srivastava and Yanagihara (2010) constructed tests using 
different distances based on traces of the covariance matrices; Li and Chen (2012) proposed 
a [/-statistic based test for a more general multivariate model. These tests are effective for 
dense alternatives, but often suffer from low power when — X 2 is sparse. We are more 
interested in this latter situation, as in genomics the difference in the dependence structures 
between populations typically involves only a small number of genes. 
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For sparse alternatives, Cai et al. (2013) investigated an Loo-type test. They proved that 
the distribution of the test statistic converges to a type I extreme value distribution under the 
null hypothesis and the test enjoys certain optimality property. Motivated by this work, we 
propose in this paper a perturbed variation of the L^-type test statistic. We verify that the 
conditional distribution of the perturbed Loo-statistic provides a high-quality approximation 
to the distribution of the original L^-type test, which has important implications in achieving 
accurate performance in finite sample size. In contrast, the convergence rate to the extreme- 
value distribution of type I is of order 0{log(logn)/logn} (Liu et ah, 2008). 

The asymptotic validity of our proposed new procedure does not require any structural 
assumptions on the unknown covariances. It is valid under weak moment conditions. On the 
other hand, the aforementioned work all require certain parametric distributional assump¬ 
tions or structural assumptions on the population covariances in order to derive an asymp¬ 
totically pivotal distribution. Assumptions of this kind are not only difficult to be verified but 
also often violated in real data. It is known that expression levels of the genes regulated by the 
same pathway (Wolen and Miles, 2012) or associated with the same functionality (Katsani et 
ah, 2014) are often highly correlated. Also, in the microarray and sequencing experiments, 
most genes are expressed at very low levels while few are expressed at high levels. This 
implies that the distribution of gene expressions is most likely heavy-tailed regardless of the 
normalization and transformations (Wang et ah, 2015). 

For testing H 0 : Ex = S 2 in high dimensions, the new procedure is computationally 
fast and adaptive to the unknown dependence structures. Section 2 introduces the new 
testing procedure and investigates its theoretical properties. In Section 3, we compare its 
finite sample performance with several competitive procedures. A gene clustering algorithm 
is derived in Section 4, which aims to group hundreds or thousands of genes based on the 
expression patterns (Sharan et ah, 2002) without imposing restrictive structural assumptions. 
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We apply the proposed procedures to the human asthma dataset in Section 5. Section 6 
discusses our results and other related work. Proofs of the theoretical results and additional 
numerical results are provided in the Supplementary Material. The proposed methods have 
been implemented in the R package HDtest and is currently available on CR.AN (http: 
//cran.r-proj ect.org) . 


2. The new testing procedure 

2.1 The Loo-statistic 

Let X = (AR,..., X p ) T and Y = (Y \,..., Y p ) T be two p-dimensional random vectors with 
means = (/an ,..., /ii p ) T and /x 2 = (/r 2 i, • • •, h 2 P ) T , and covariance matrices Xi= ((Ji )k e)i^k,t^p 
and X 2 = (cr 2i fc£)i^fc,^ P , respectively. We are interested in testing 

Hq -.Hi = X 2 versus Hi : Zb ^ X 2 (2.1) 

based on independent random samples X n = {X 1; ..., X n } and y m = {Y 1; ..., Y m } drawn 

from the distributions of X and Y, respectively. For each i and j , we write X.; = (Xu ,..., Xi p ) T 

and Yj = (Yji,.Y jp ) T . Let S, = (a 1M ) 1<kMp = n" 1 E”=i(X* - X)(X, - X) T and 

X 2 = ((T 2 m) i^kjx,p = m ~ Y)(Y j — Y) T be the sample analogues of 5R and 

S 2 , where X = (X x ,..., X P ) T = n~ l EE X, and Y = (W,..., Y P ) T = m' 1 E™, Y,, 

For each (k,£), a straightforward extension of the two-sample t-statistic for the marginal 

hypothesis H 0yke : a 1M = o 2M versus H lM : a 1M ^ a 2 ,kt is given by 

f _ M ~ ,U / 0 

f'ki / _i /s _i a M/2’ 

(n L si M + m 1 s 2 m) / 

where s 1M = n~ l YYi=\{( x ik’-X k )(X u -Xe)^ai M } 2 and s 2 ,u = m ~ l E^i((M^Y fc )(Y^- 
%)-V2,ktY are estimators of s 1M = Yar{(X k -fi lk )(X l -/a u )} and s 2M = Var{( Y k ~n 2k )(Y £ - 
h 2 t)}, respectively. 

Since the null hypothesis in (2.1) is equivalent to H 0 : maxi^^p \ai tk e — ct 2 ,h| = 0, a 
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natural test statistic that is powerful against sparse alternatives in (2.1) is the Loo-statistic 

L max = max \ikt\. (2.3) 

2.2 A new testing procedure 

One way to base a testing procedure on the Loo-statistic is to reject the null hypothesis (2.1) 
when T^ ax — 4 logp + log(logp) > q a , where q a = — log( 87 r) — 2 log log(l — a ) -1 corresponds 
to the (1 — a)-quantilc of the type I extreme value distribution. Cai et al. (2013) proved that 
this leads to a test that maintains level a asymptotically and enjoys certain optimality. 

In this section, we propose a new test that rejects (2.1) when T max > c a , where c a is 
obtained using a fast-computing data perturbation procedure. The new procedure resolves 
two issues at once. First, it achieves better finite sample performance by avoiding the slow 
convergence of T aiax — 4 log 71 + log(logp) to the type I extreme value distribution. Second and 
more importantly, our procedure relaxes the conditions on the covariance matrices required 
in Cai et al. (2013) (particularly, their Conditions (Cl) and (C3)). Note that their Condition 
(Cl) essentially requires that the number of variables that have non-degenerate correlations 
with others should grow no faster than the rate of p. Although this condition is reasonable 
in some applications, it is hard to be justified for data from the microarray or transcriptome 
experiments, where the genes can be divided into gene sets with varying sizes according to 
functionalities, and usually genes from the same set have relatively high (sometimes very 
high) intergene correlations compared to those from different sets. This corresponds to an 
approximate block structure. Many sets can contain several thousand genes, a polynomial 
order of p. This kind of block structure with growing block size may violate Condition (Cl) 
in Cai et al. (2013). The crux of the derivation of the asymptotic type I extreme value 
distribution in (Cai et ah, 2013) is that the t^s are weakly dependent under H 0 under 
certain regularity conditions. I 11 contrast, the new procedure we present below automatically 
takes into account correlations among the tf-e s. 
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Specifically, we propose the following procedure to compute c a with the dependence among 
tkt s incorporated. 

(I) . Independent of X n and y m , we generate a sequence of independent 7V(0,1) random 
variables g\, ..., g^, where N = n + m is the total sample size. 

(II) . Using the g^s as multipliers, we calculate the perturbed version of the test statistic 

iii> ( 2 - 4 ) 

where t ] u = {n^SxM + ,m) _1/2 (^h - with <*t,ke = n_1 IXi 9i{( X ik ~ x k){ x u ~ 

X ( ) - cr 1M } and a\ M = m _1 Y!j=i9n+j{{Y jk - Y k )(Y j£ - Y e ) - a 2 ,u}- 

(III) . The critical value c a is defined as the upper a-quantile of T^ ax conditional on 
{X n ,y m }; that is, c a = inf |t 6 1 : P g (!Zjt aY > t) ^ aj, where P g denotes the probability 
measure induced by the Gaussian random variables {gi}^ = i with X n and y m being fixed. 

This algorithm combines the ideas of multiplier bootstrap and parametric bootstrap. The 
principle of parametric bootstrap allows f^’s constructed in step (II) to retain the covariance 
structure of t k £ s. The validity of multiplier bootstrap is guaranteed by the multiplier central 
limit theorem, see van der Vaart and Wellner (1996) for traditional fixed- and low-dimensional 
settings and Chernozhukov et al. (2013) for more recent development in high dimensions. 

For implementation, it is natural to compute the critical value c a via Monte Carlo sim¬ 
ulation by Cb )Q = inf(f 6 M. : 1 — F B (t) ^ a}, where F B (t) = TU 1 XlfLi ^ f) and 
Tj ,. ■. ,T' b are B independent realizations of Tj, aY in (2.4) by repeating steps (I) and (II). 
For any prespecified a G (0,1), the null hypothesis (2.1) is rejected whenever T max > c B , a . 

The main computational cost of our procedure for computing the critical value c B)Ct only 
involves generating NB independent and identically distributed 1V(0,1) variables. It took 
only 0.0115 seconds to generate one million such realizations based on a computer equipped 
with Intel(R) Core(MT) i7-4770 CPU @ 3.40GHz. Hence even taking B to be in the order 
of thousands, our procedure can be easily accomplished efficiently when p is large. 
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2.3 Theoretical properties 

The difference between c a and its Monte Carlo counterpart CB, a is usually negligible for a 
large value of B. In this section, we study the asymptotic properties of the proposed test 
= /{T max > c a } under both the null hypothesis (2.1) and a sequence of local alternatives. 

For the asymptotic properties, we only require the following relaxed regularity conditions. 
Let K > 0 be a finite constant independent of n, m and p. 

(Cl). {E(|X fc | 2r )}b /r < Ka hkk , {E(|l').| 2r )} 1 / r ^ Ka 2 ,kk uniformly in k = 1,... ,p, for some r > 4. 

(C2). maxi^fc^p E{exp(KA"|/<Ti i fcfc)} ^ K and rnaxi^^ Ejexp^Y^/o^)} ^ K for some k > 0. 

(C3). Si,ki/{vi,kkG\,u) ^ c and mini^^^p s- 2 ,,kt/(& 2 ,kk& 2 ,u) ^ c for some c > 0. 

(C4). n and m are comparable, i.e. n/m is uniformly bounded away from zero and infinity. 

Assumptions (Cl) and (C2) specify the polynomial-type and exponential-type tails condi¬ 
tions on the underlying distributions of X and Y, respectively. Assumption (C3) ensures that 
the random variables {(X k - /a lk )(X e -nu)}i <k ^ p and {(Y k - H 2 k){Ye~ V 2 e)}i^k,e^ P are non¬ 
degenerate. The moment assumptions, (C1)-(C3), for the proposed procedure are similar to 
Conditions (C2) and (C2*) in Cai et al. (2013). Assumption (C4) is a standard condition in 
two-sample hypothesis testing problems. As discussed before, no structural assumptions on 
the unknown covariances are imposed for the proposed procedure. Theorem 1 below shows 
that, under these mild moment and regularity conditions, the proposed test with c a 
defined in Section 2.2 has an asymptotically a. 

Theorem 1: Suppose that Assumptions (C3) and (C4) hold. If either Assumption (Cl) 
holds with p = 0(n r/2 ~ l ~ 5 ) for some constant 5 > 0 or Assumption (C2) holds with logp = 
o(n 1//7 ), then as n,m —>■ oo, P// 0 ( V I / Q = 1) —>■ a uniformly over a G (0,1). 

Remark 1: The asymptotic validity of the proposed test is obtained without imposing 
structural assumptions on Si and S 2 , nor do we specify any a priori parametric shape 
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constraints of the data distributions, such as Condition A3 in Li and Chen (2012) or 
Conditions (Cl) and (C3) in Cai et al. (2013). 

Next, we investigate the asymptotic power of 'h Q ,. It is known that the Loo-type test 
statistics are preferred to the L 2 -type statistics, including those proposed by Schott (2007) 
and Li and Chen (2012), when sparse alternatives are under consideration. As discussed in 
Section 1, the scenario in which the difference between Xh and S 2 occurs only at a small 
number of locations is of great interest in a variety of scientific studies. Therefore, we focus 
on the local sparse alternatives characterized by the following class of matrices 

= {(£i,E 2 ) : Si and E 2 are positive semi-definite matrices satisfying 

Assumption (C3) and max -— \ aiM — °jcj_ I—^ (logp ) 1//2 7 l. 

(n-^i,*/ + m- 1 s 2 ,ki ) y2 J 

Theorem 2 below shows that, with probability tending to 1, the proposed test is able to 
distinguish H 0 from the alternative H i whenever (Ej, S 2 ) G 7 ) for some 7 > 2. 

Theorem 2: Suppose that Assumptions (C3) and ( Cf ) hold. If either Assumption 1 
holds with p = 0(n r ’ //2 ~ 1 ~ <5 ) for some constant 5 > 0 or Assumption 2 holds with logp = 
o(n 1//2 ), then as n,m —> 00 , inf(s 1 ,s 2 )eAr( 7 ) I E Ri( v I'a = 1 ) —» 1 for any 7 > 2 . 

Theorem 2 of Cai et al. (2013) requires 7 = 4 to guarantee the consistency of their proce¬ 
dure. Moreover, they showed that the rate (logp) 1 ,/ 2 'n ~ 1//2 for the lower bound of the maximum 
magnitude of the entries of Sh — E 2 is minim ax optimal, that is, for any a,/3 > 0 satisfying 
a + (3 < 1, there exists a constant 70 > 0 such that inf(s 1 ,s 2 )eAi( 7 o) su PT a eT Q {T a — 1) ^ 
1 — f3 for all sufficiently large n and p, where T a is the set of a-lcvcl tests over the collection 
of distributions satisfying Assumptions (Cl) and (C2). Hence, our proposed test also enjoys 
the optimal rate and is powerful against sparse alternatives. 
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3. Simulation studies 

In this section, we compare the finite-sample performance of the proposed new test with 
that of several alternative testing procedures, including Schott (2007) (Sc hereafter), Li 
and Chen (2012) (LC hereafter) and Cai et al. (2013) (CLX hereafter). We generated two 
independent random samples {Xj }™ =1 and {Yj }™ =1 such that X, : = EjifZ - 1 ' 1 and Yj = 
sf, 2 zf with Z™ = .... 4‘>) T and zf = , Z®) T , where zg\ .... z£> and 

/o\ 

Z), ,..., Zj p are two sets of independent and identically distributed (i.i.d.) random variables 
with variances a 2 zi and cr| 2 , such that Ex = a zl 'E 1 ^ and E 2 = cr| 2 E 2i *. We assess the 
performance of the aforementioned tests under the null hypothesis (2.1). Let E^* = S 2) * = 
E* and consider the following four different covariance structures for E*. 


• Ml (Block diagonals): Set E* = D 1 / 2 AD 1 / 2 , where D is a diagonal matrix whose diagonals 
are i.i.d. random variables drawn from Unif(0.5, 2.5). Let A = (ake)i<^k/^p, where a k k = 1, 
aki = 0.55 for 10(g — 1) + 1^A;^^^ 10g for q — 1,..., [p/10j, and a k e = 0 otherwise. 

• M2 (Slow exponential decay): Set E* = (cr ke,*)i^k/^ P , where a k i,* = 0.99l fc_£ l 1/3 . 

• M3 (Long range dependence): Let E* = {(Jki,*)i^k,e^ P with i.i.d. <Jkk,* ~ Unif(l,2), and 
&k£,* = Pa(\k ~ ^\)i where p a (d ) = {(d + 1) 2H + (d — 1) 2H — 2d 2H }/2 with H = 0.85. 

• M4 (Non-sparsity): Define matrices F = (/h)km<p with fkk = 1, fk,k+i = fe+i,i = 0.5, 
U ~ W(V p ,fco), the uniform distribution on the Stiefcl manifold (i.e. U G W xk ° and U T U = 
I ko , the ko -dimensional identity matrix), and diagonal matrix D with diagonal entries being 
i.i.d. Unif(l,6) random variables. We took k 0 = 10 and E* = D 1 / 2 (F + UU T )D 1//2 . 

In practice, non-Gaussian measurements are particularly common for high throughput data, 
such as data with heavy tails in microarray experiments and data of count type with zero- 
inflation in image processing. To mimic these practical scenarios, we considered the following 
three models of innovations and Zjf to generate data. 

• (Dl) Let Z-}J and Zj 2 J be Gamma random variables: Z^\ Z^ ~ Gamma(4,10). 
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• (D2) Let Z-jj and Z^ be zero-inflated Poisson random variables: Z^\ Z^ ~ Pois(lOOO) 
with probability 0.15 and equals to zero with probability 0.85. 

• (D3) Let z\l ) and Z^ be Student’s t random variables: Z^ ~ t 5 and Z^j ~ t 5 (/x) with 
non-central parameter /i drawn from Unif(—2,2). 

For the numerical experiments, {ri\ ,n 2 ) was taken to be (45,45) and (60,80), and the 
dimension p took value in {80, 280, 500,1000}. To compute the critical value for the proposed 
test B was taken to be 1500. 

[Table 1 about here.] 

[Table 2 about here.] 

Tables 1 and 2 display the empirical sizes of T s,a , the LC test, Sc test and CLX test. 
For both the Gamma and zero-inflated Poisson data (models D1 and D2), the Sc test fails 
to maintain the nominal size while the other three tests maintain the significance level 
reasonably well. For the t-distributed data (model D3), both the Sc and LC tests had 
distorted empirical sizes. In contrast, the proposed test has empirical size closer to 

the nominal level for the t-distributed data while the CLX test is more conservative. This 
confirms the early discussions that the limiting distribution based approach for Loo-type 
test procedure can sometimes be conservative. Compared to the existing methods, 
has a much wider applicability as it requires no structural assumptions on the unknown 
covariances and circumvents the issue of slow convergence of Loo-type statistic to its limiting 
distribution. Overall, 4^,0 maintains the nominal size in finite sample reasonably well and 
is robust against unknown covariance structures as well as data generation mechanisms. 

To evaluate the power performance against relatively sparse alternatives, we define a 
perturbation matrix Q with |_0.05pJ random non-zero entries. Half of the non-zero entries 
are randomly allocated in the upper triangle part of Q and the others are in its lower 
triangle part by symmetry. The magnitudes of non-zero entries are randomly generated from 
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Unif(r/2, 3r/2) with r = 8 max{max ls ; fc ^p cr fcfci „, (logp) 1//2 }, where cr fcfe) *’s are the diagonal 
entries of E* specified before. We take Ei,* = E* + AoI p and E 2) * = E* + Q + AoI p , where 
Ao = | min{A m i n (E* + Q), A m i n (E*)}| +0.05 with A min (A) denoting the smallest eigenvalue of 
matrix A. For the Gamma and zero-inflated Poisson data (panels for D1 and D2 in Figure 1), 
only the proposed test \1 / B , a , the LC and CLX tests are considered since the Sc test is no 
longer applicable due to inflated sizes; and similarly, for the t-distributed data (panels for 
D3 in Figure 1), only T B , a and the CLX test are considered. 

[Figure 1 about here.] 

Figure 1 displays empirical power comparisons. We see that the proposed test T Bi0l and 
the CLX test are substantially more powerful than the LC test against sparse alternatives 
for the Gamma and zero-inflated Poisson data (data models D1 and D2) under different 
covariance structures. As the number of non-zero entries of Si — S 2 grows in p, both the 
proposed test + Ba and the CLX test gain powers while the LC test do not gain much due to 
the sparsity of Ei — S 2 . For the Gamma and zero-inflated Poisson data, the proposed test 
is slightly more powerful than the CLX test when the sample size is small and the two tests 
are closely comparable as the sample size increasing. For t-distributed data (data model D3), 
^ 3,0 is more powerful than the CLX test and gains more powers along increasing sample 
sizes and dimensions. In summary, ^ B ,a outperforms the other three for sparse alternatives 
of interest. More simulation results are reported in the supplementary material. 

4. Application of the proposed procedure in gene clustering 

The primary goal of gene clustering is to group genes with similar expression patterns 
together, which usually provides insights on their biological functions or regulatory pathways. 
In genomic studies, gene clustering has been employed for detecting co-expression gene sets 
(D’haeseleer, 2005; Sharan et ah, 2002), identifying functionally related genes (Yi et ah, 


Testing Large Covariance Matrices 


11 


2007), and discovering large groups of genes suggestive of co-regulation by common factors, 
among other applications. 

Consider a random sample X n = {X],..., X n } of n independent observations from X = 
(Xi,... ,Xp) T with covariance Si = (cri,*^)^/^ and correlation Ri = (pi,ke)i^kM P , where 
Xj records the expression levels of p genes from subject i. To cluster the genes based on their 
expression levels, some dissimilarity or proximity measure for the p genes, or equivalently, 
the p variables, is calculated based on X n , to which clustering algorithms are applied. Gene 
clustering can therefore be achieved via clustering the variables. To discover the clustering 
structure of variables, it is intuitive that variables X k and X? will be clustered in the same 
group if \pi,kt\ is large and separated otherwise (Wagaman and Levina, 2009). Specifically, 
if there are some clustering structures among variables, then there exists a partition of 
{l,...,p} upon potential permutations, denoted by {Bf}^ for some 1 ^ m ^ p, such 
that mm k / e Bt \pi,ki\ > Fl, and for an y 1 ^ t ^ t' ^ m, ma x ke B t /eB t , \pi,ki\ < c 2, where 
Ci,C 2 > 0 are positive constants. The problem is then closely related to testing one-sample 
hypotheses that for a given A C Z p = {(1,1),..., (l,p), (2,1),..., (2 ,p ),..., (p,p)}, Hfr : 
piM = 0 for any (k, i) G A versus : p\g-t ^ 0 for some (k, £) E A, which is equivalent to 

Hq : cr\ y kt = 0 for any (k, f) 6 A versus : a^ki 7^ 0 for some (k, G A. (4.1) 

Testing the hypothesis (4.1) facilitates recovering the dissimilarity patterns among variables; 
that is, failing to reject Hq indicates the segregation between X k and X( whenever (k, £) G A. 

Motivated by the block-wise estimation method of Caragea and Smith (2007), we define A 
in the following way. First, we place the covariance matrix on a p x p grid indexed by X p 
and partition it with blocks of moderate size. Due to symmetry, we only focus on the upper 
triangle part. Second, we construct blocks of size so x so along the diagonal and note that 
the last block may be of a smaller size if so is n °f a divisor of p. Next, we create new blocks 
of size So x So successively toward the top right corner. Similarly as before, blocks to the 
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most right may be of smaller size. The grid, or equivalently, the index set X p , is partitioned 
into S = \p/ So ](\p/s 0 ] + l)/2 sub-regions and we denote by Ai,..., A 5 the partition of the 
upper triangle indices {(k,£) : 1 ^ k < £ ^ p}. 

On each of the sub-regions, we modify the proposed procedure for testing local hypotheses 
Ho s '■ &i,kt = 0 for any (k, £) G A s versus H± s : 7 ^ 0 for some (k, £) G A s , s = 1 ,S. 

We then apply the Benjamini-Hochberg (BH) procedure to control the false discovery rate 
(FDR) for simultaneously testing S hypotheses. For each s, failing to reject the null Hq s 

indicates a segregation between X k and Xp for (k, £) G A s and zero will be assigned as the 

similarity between X k and Xp. We summarize this procedure as follows. 

(I) Compute the sample covariance matrix Si = (< 71 ^) 1 ^k,e^ P and T = (ike)i^.k,e^p, where 

tu = o\ y u for s\ y u defined in Section (2.2). 

(II) Independent of X n , simulate a sample of size R, where for each b = 1 ,,B and 

1 ^ k ^ ^ p, compute V b M = (n _1 s i )k p)~ 1/2 Y^=i 9 b,i{{X ik - X k )(X u - Xp) - cr 1M }, where 
{gb,ii ■ ■ ■ j gb,n} is a sequence of i.i.d. standard normal random variables. 

(III) Partition the p x p grid as discussed before by S blocks. For each block with entries 
indexed by A s C Z p , compute the approximated p-value as p s — 1 — Fb (max(^ e A, tu) , 
where F b denotes the empirical (conditional) distribution function of max^^gA, t k p given 
X n using the simulated samples (max^gA, 

(IV) Estimate the g-values for {p s }f =1 using the BH procedure, denoted by {q s }. For a 
prespecified cut-off n, define the dissimilarity measure by 

d k p = 1- / H ^ s < n ~ -- for any (k,£) G A s . (4.2) 

max{max (M)eAs t k p, 1 } 

Based on the measure in (4.2), we can apply clustering algorithms such as the hierarchical 
clustering for clustering variables and obtain gene clustering. To specify the blocks, we 
propose the following data-driven selection of s 0 - The S local hypotheses to be tested 
simultaneously admit unknown complex dependencies so that the FDR, controlled by the 



Testing Large Covariance Matrices 


13 


BH procedure, satisfies the general upper bound FDR ^ (7rS'olog5)/5 l where Sq denotes 
the number of true null local hypotheses (Benjamini and Yekutieli, 2001). To control the 
FDR at the nominal level 7 r, we need S ^ Sq log S which is automatically satished when 
S' = 1 or so is large. Therefore, we define a data-driven so by so = max{ [logp], min(s : 
S'o(s) ^ S'(s)[log{S'(s)}] -1 )}, where S'(s) = \p/s](\p/s] +1)/2 and Sq is an estimate for the 
number of true null local hypotheses. In practice, we may also reorder the variables first using 
methods such as the Isoband algorithm by Wagaman and Levina (2009). A demonstration of 
the proposed clustering algorithm, as well as comparisons of du with traditional dissimilarity 
measures based on the human asthma data, is displayed in the Supplementary Materials. 

5. Application to analysis of human asthma data 

5.1 Background 

As a common chronic inflammatory disease of the airways, asthma is caused by a combination 
of complex genetic and environmental interactions and affects more than 200 million people 
worldwide as of 2013 as shown in 2013 World Health Organization Fact Sheet No. 307. The 
mechanism and regulatory pathways remain unclear. We illustrate the proposed new proce¬ 
dures using the human asthma data from the microarray experiment reported by Voraphani 
et al. (2014), which was aimed to understand the regulatory pathway and mechanism for 
high nitrative stress, a major characteristic of human severe asthma. Voraphani et al. (2014) 
identified several novel pathways, including discovering that the Thl cytokine, IFN-y, along 
or with Th2 regulations, are critical immune agents for the disease development by amplifying 
epithelial NAD/NADPH thyroid oxidase expression and aiding the production of nitrite. 

The original microarray gene expression data are available at the NCBI’s Gene Expression 
Omnibus database with the Gene Expression Omnibus Series accession number GSE43696. 
The data consist of ri\ = 20 health samples and 77,2 = 88 patients suffering from moderate or 
severe asthmatics. We focused on identifying disease-associated GO terms. After preliminary 
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filtering steps using the approach in Gentleman et al. (2005) and removing genes without 
appropriate annotations, there remained 24, 520 genes. We excluded GO terms with missing 
information or less than 10 genes. There retained 3, 290 GO terms from the original dataset 
whose sizes vary from 11 to 8, 070 genes. For g — 1,..., G with G = 3, 290, denote by pt hg 
and n a the mean gene expression levels, and S h, g and S ajff the covariance matrices for the 
g th GO term in the control and disease groups, respectively. 

5.2 Differential expression analysis 

A commonly used method in differential analysis is the mean-based test that selects interest¬ 
ing GO terms by testing the null hypothesis that overall gene expressions within a GO term 
are similar across populations (Chen and Qin, 2010; Chang et ah, 2014; Wang et ah, 2015). 
Though the mean-based procedure has been successful in detecting differential expressed 
genes based on the changes in the expression level, recent developments in genomic analysis 
have revealed the importance to detect genes with changing relationships with other genes in 
different biological states, and particularly GO terms that change the dependence structures 
across populations (de la Fuente, 2010). The discovery of those GO terms with altered 
dependence structures provides information on critical gene regulation pathways. Consider 
all the GO terms, we applied the proposed method T B>a to test the global hypotheses 

Hq g ■ ^ h,g = S a,g VerSUS H lg \ X h,g % ^a,g- (5-1) 

For a comparison, we also applied the LG and CLX tests. 

Here, B — 5, 000 Monte Carlo replications were employed to compute the p -values for T B , a - 
By controlling the FDR at 2.5% (Benjamini and Yekutieli, 2001), the proposed test T B ,a 
declared 969 GO terms significant while the LG and CLX tests declared 290 and 524 GO 
terms significant, respectively. The proposed test has found more significant GO terms 
and is less conservative than the others, which is also reflected by the histograms of p-values 
for the three tests displayed in the Supplementary Material. Table 3 displays the top 15 most 
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significant GO terms declared by '&s,a and also highlights those GO terms that were not 
detected by the LG and CLX tests. For example, G0:0005887 (integral to plasma membrane) 
is functionally relevant to the dual oxidases (DUOX2)-thyroid peroxidase interaction and is 
important to the mechanism of asthma development (Voraphani et ah, 2014). It is worth 
noticing that 'll b,o is able to discover this biologically important GO term that is missed by 
the others. This further highlights the good performance of our proposed test. 

[Table 3 about here.] 

In addition, we compared the study on changing intergene relationships across biological 
states with the traditional differential analysis based on mean expression levels. The proposed 
test on intergene relationships discovered 268 significant GO terms that were missed by the 
traditional differential analysis. This reflects the lately growing demands on analyzing gene 
dependence structures. More details on this comparison are retained in the supplement. 

5.3 Gene clustering study on GO terms of interest 

Voraphani et al. (2014) revealed a novel pathway involving epithelial iNOS, dual oxidases, 
TPO and the cytokine INF -7 to understand the mechanism of human asthma. Multiple 
transcripts, together with their variants, are related, while their co-regulation mechanisms are 
less clear. The proposed gene clustering algorithm provides a way to study gene interactions. 

For illustration, we focus on the GO terms that were declared significant via testing (5.1) 
and are related to IFN -7 or TPO, and apply our clustering procedure to the sample from 
the health and disease groups separately to study how the gene clustering alters across two 
populations. For IFN- 7 , we consider the GO terms 0032689 (negative regulation of IFN -7 
production), 0060333 (IFN- 7 -mediated signaling pathway) and 0071346 (cellular response 
to IFN- 7 ). For TPO, the GO terms have been considered include 0004601 (peroxidase ac¬ 
tivity), 0042446 (hormone biosynthetic process), 0035162 (embryonic hemopoiesis), 0006979 
(response to oxidative stress), and 0009986 (cell surface). Their sizes vary from 17 to 439. 
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[Figure 2 about here.] 

We take B = 5, 000, a = 0.05 and use hierarchical clustering algorithm with average 
linkage. The S$ is estimated using the censored Beta-Uniform mixture model by Markitsis 
and Lai (2010) for selecting block size So- Figures 2-3 display comparisons of gene clustering 
between the health and disease groups (more comparisons are included in the Supplementary 
Material). Each vertex in the figures represents a gene or its variant and is labelled by the 
corresponding ID. Vertexes connected by edges in gray are clustered into one group, and 
vertexes in red and yellow belong respectively to the maximum clique in the health and 
disease groups. Vertexes in both colors belong to the maximum cliques for both groups. 

From Figure 2 we see that for G0:0071346, regarding the cellular response to INF- 7 , genes 
tend to function more in clusters in the asthma group than those in the health group. Gene 
TLR3 actively appears in the largest gene clusters for both the health and asthma groups, 
while gene IL18 is isolated in the asthma group. Gene NOS2 is involved in asthma by 
co-regulating with ARG2. These suggest that these four genes are important signatures for 
understanding the effect of INF -7 on the asthma progression. Regarding the INF- 7 -mediated 
signaling pathway, Figure 2 also shows that compared to the health group, genes seem to 
preferentially function separately in the asthma group. The original dominating gene clusters 
are broken into small groups in the presence of the disease. The different configurations in 
primary gene clusters between the health and asthma groups for G0:0060333 provide further 
information on how INF -7 influences the iNOS pathway. For the critical enzyme TPO, Figure 
3 shows that genes also tend to function in clusters in the disease group. In the presence of 
asthma, the gene cluster HBB-HBA2.1-HBA2 is preserved and the gene IPCEF1 is isolated 
from the original largest gene cluster for G0:0004601. It is interesting to notice that the 
DUOX2 genes are isolated in the health group but do interact with many genes, particularly 
with TPO, in the presence of asthma as documented in Voraphani et al. (2014). The identified 
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DU0X2 gene cluster provides a candidate pathway to understand how TPO catalyzes the 
iNOS-DUOX2-thyroid peroxidase pathway discovered by Voraphani et al. (2014). Last but 
not least, it can be seen from Figure 3 that the overall co-regulation patterns remain similar 
across populations, while those of TPO alters in the presence of asthma. 

In summary, based on the proposed procedure, not only can we test the difference in gene 
dependence, we can also discover the disparity in gene clustering, which reflects the difference 
in gene clustering patterns between the health and disease groups. 

[Figure 3 about here.] 


6. Conclusion and discussion 

In this paper, we proposed a computationally fast and effective procedure for testing the 
equality of two large covariance matrices. The proposed procedure is powerful against sparse 
alternatives corresponding to the situation where the two covariance matrices differ only in 
a small fraction of entries. Compared to existing tests, the proposed procedure requires no 
structural assumptions on the unknown covariance matrices and remains valid under mild 
conditions. These appealing features grant the proposed test a vast applicability, particu¬ 
larly for real problems arising in genomics. As an important application, we introduced a 
gene clustering algorithm that enjoys the same nice feature of avoiding imposing structural 
assumptions on the unknown covariance matrices. 

Another interesting and related problem is testing the equality of two precision matrices, 
which was recently studied by Xia et al. (2015). In the literature of graphical models, it is 
common to impose the Gaussian assumption on data so that the conditional dependency 
can be inferred based on the precision matrix. When the discrepancy between two precision 
matrices is believed to be sparse, the data-dependent procedure considered in this paper 
can be extended to comparing them by utilizing the similar Loo-type statistic discussed in 
Xia et al. (2015). It is interesting to investigate whether our method can be applied to 
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testing precision matrices in the presence of heavy-tailed data, which is often modeled by 
the elliptical distribution family. We leave this to future work. 

7. Supplementary Materials 

Web Appendices, which include proofs of the main theorems and additional numerical results 
referenced in Sections 2, 3 and 5 are available with this paper on the Biometrics website on 
Wiley Online Library. 
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(a) Covariance structure Ml 





(b) Covariance structure M2 



(c) Covariance structure M3 
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(d) Covariance structure M4 


Figure 1: Comparison of empirical powers for data generated by data models D1-D3 with 
different covariance structures. In each panel, horizontal and vertical axes depict dimension p 
and empirical powers, respectively; and unbroken lines and dashed lines represent the results 
for ( 711 , 712 ) = (45,45) and (60,80), respectively. The different symbols on the lines represent 
different tests experimented in the study, where o, □, and + indicate the proposed test, 
tests by Li and Chen (2012) and Cai et al. (2013), respectively. Results are based on 1000 
replications with a = 0.05. 
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Control-Group G0:0071346 


Asthma-Group G0:0071346 




(a) G0:0071346, cellular response to INF -7 
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(b) G0:0060333, INF- 7 -mediated signaling pathway 


Figure 2: Comparison of clustering structures of G0:0071346, cellular response to INF- 
7 and G0:0060333, INF- 7 -mediated signaling pathway, between health and disease groups 
using the proposed gene clustering procedure. This figure appears in color in the electronic 
version of this article. 
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Control-Group G0:0004601 


Asthma-Group G0:0004601 




(a) G0:0004601, peroxidase activity 


Control-Group G0:0035162 




Asthma-Group G0:0035162 


(b) G0:0035162, embryonic hemopoiesis 


Figure 3: Comparison of clustering structures of G0:0004601, peroxidase activity and 
G0:0035162, embryonic hemopoiesis, between health and disease groups using the proposed 
gene clustering procedure. This figure appears in color in the electronic version of this article. 



















Table 1: Empirical sizes of the proposed test T b j± along with those of the tests by Li and Chen (2012) (LC), Schott (2007) (Sc), 
and Cai et al. (2013) (CLX) for data generated by data models D1-D3 with covariance structures Ml and M2. Results are based 
on 1000 replications with a = 0.05, (ni,n 2 ) = (45,45) and (60,80). 
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Table 2: Empirical sizes of the proposed test T b, 0 along with those of the tests by Li and Chen (2012) (LC), Schott (2007) (Sc), 
and Cai et al. (2013) (CLX) for data generated by data models D1-D3 with covariance structures M3 and M4. Results are based 
on 1000 replications with a = 0.05, (n!,n 2 ) = (45,45) and (60,80). 
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Tabic 3: Top 15 most significant GO terms detected by T B, a with FDR controlled at 2.5%, 
b and f refer to the GO terms not being declared significant by the CLX test and the LC 
test, respectively. 


GO ID 

GO term name 

G0:0006886 

intracellular protein transport 1 

G0:0008565 

protein transporter activity 1 

G0:0030117 

membrane coat 1 

G0:0005515 

protein binding^ 

G0:0016032 

viral reproduction^ 

G0:0005829 

cytosod 

G0:0000278 

mitotic cell cycle! 

G0:0006334 

nucleosome assembly! 

G0:0034080 

CenH3-containing nucleosome assembly at centromere 

G0:0006974 

response to DNA damage stimulus! 

G0:0016874 

ligase activity! 

G0:0032007 

negative regulation of TOR signaling cascade! 

G0:0005887 

integral to plasma membrane 1 '’! 

G0:0006997 

nucleus organization! 

G0:0030154 

cell differentiation! 



